11 research outputs found

    Singular value computations on the AP1000 array computer

    No full text
    The increasing popularity of singular value decomposition algorithms, used as a tool in many areas of science and engineering, demands a rapid development of their fast and reliable implementations. No longer are those implementations bounded to the single processor environment since more and more parallel computers are available on the market. This situation requires that often software need to be re-implemented on those new parallel architectures efficiently. In this thesis we show, on the example of a singular value decomposition algorithm, how this task of changing the working environments can be accomplished with non-trivial gains in performance. We show several optimisation techniques and their impact on the algorithm performance on all parallel memory hierarchy levels (register, cache, main memory and external processor memory levels). The central principle in all of the optimisations presented herein is to increase the number of columns (column-segments) being held in each level of the memory hierarchy and therefore increase the data reuse factors. In the optimisations for the parallel memory hierarchy the techniques used are, rectangular processor configuration, partitioning, and four-column rotation. The rectangular processor configuration technique is where the data were mapped onto a rectangular network of processors instead of a linear one. This technique improves the communication and cache performance such that on average, we reduced the execution time by a factor of 2 and, in the case of long column-segments, by a factor of 5. The partitioning technique involves rearranging data and the order of computations in the cells. This technique increases the cache hit ratio for large matrices. For the relatively modest improvements in the cache performance of 2 to 5%, we achieved a significant reduction in the execution times of 10 to 20%. The four-column rotation technique improves the performance by a better register reuse. For the cases of large number of columns stored per processor, this technique gave 2 to 10% improvement in execution time over the 'classic', two column rotation. Apart from the optimisations on the memory hierarchy levels, several floating point optimisations are presented on the algorithm itself which can be applied in any architecture. The main ideas behind those optimisations are the reduction of the number of floating point instructions executed in a unit of time and the balance of the floating point operations. This was accomplished by reshaping the relevant parts of the code to use the APlOOO processors architecture (SPARC) to its full potential. After combining all of the optimisations, we achieved a sustained 60% reduction of the execution time which corresponds to the 2.5 fold reduction. In the cases where long columns of the input matrix were used, we achieved nearly 5 fold reduction in execution time without adversely affecting the accuracy of the singular values and maintaining the quadratic convergence of the algorithm. The algorithm was implemented on the Fujitsu's APlOOO Array Multiprocessor, but all optimisations described can be easily applied to any MIMD architecture with a mesh or hypercube topology, and all but one can be applied to register-cache uniprocessors also. Despite many changes in the structure of the algorithm we found that the convergence was not adversely affected and the accuracy of the orthogonalisation was no worse than for the uniprocessor implementation of the noted SVD algorithm

    Optimisations for the memory hierarchy of a Singular Value Decomposition Algorithm implemented on the MIMD Architecture

    No full text
    The increasing popularity of Singular Value Decomposition Algorithms, used in real time signal processing, demands a rapid development of their fast and reliable implementations. This paper shows several modifications to the Jacobi-like parallel algorithm for Singular Value Decomposition (SVD) and their impact on the algorithm's performance. In particular, the optimisations for the parallel memory hierarchy (register, cache, main memory and external processor memory levels) can dramatically increase the performance of the Hestenes SVD algorithm. The central principle in all of the optimisations presented herein is to increase the number of columns (column segments) being held in each level of the memory hierarchy. The algorithm was implemented on the Fujitsu's AP1000 Array Multiprocessor, but all optimisations described can be easily applied to any MIMD architecture with a mesh or hypercube topology, and all but one can be applied to register-cache uniprocessors also. 1 Introduction B..

    Performance Analysis of KDD Applications using Hardware Event Counters

    No full text
    Modern processors and computer systems are designed to be efficient and achieve high performance with applications that have regular memory access patterns. For example, dense linear algebra routines can be implemented to achieve near peak performance. While such routines have traditionally formed the core of many scientific and engineering applications, commercial workloads like database and web servers, or decision support systems (data warehouses and data mining) are one of the fastest growing segments in the high-performance computing market. Many of these commercial applications are characterised by complex codes and irregular memory access patterns, which often result in a decreased performance. Due to their complexity and the lack of source code, performance analysis of commercial applications is not an easy task. Hardware performance counters allow acquisition of low level, reliable data, necessary to perform detailed analysis of program behaviour. In this paper we describe experiments and present first results conducted with various KDD applications on an UltraSPARC III platform

    How fast is -fast? Performance analysis of KDD applications using hardware performance counters on UltraSPARC-III

    No full text
    Modern processors and computer systems are designed to be eÆcient and achieve high performance with applications that have regular memory access patterns. For example, dense linear algebra routines can be implemented to achieve near peak performance. While such routines have traditionally formed the core of many scientific and engineering applications, commercial workloads like database and web servers, or decision support systems (data warehouses and data mining) are one of the fastest growing market segments on high-performance computing platforms. Many of these commercial applications are characterised by more complex codes and irregular memory access patterns, which often result in a decrease of performance that is achieved. Due to their complexity and the lack of source code, performance analysis of commercial applications is not an easy task. Hardware performance counters allow detailed analysis of program behaviour, like number of instructions of various types, memory and cache access, hit and miss rates, or branch mispredictions. In this paper we describe experiments and present results conducted with various KDD applications on an UltraSPARC-III platform, and we compare these applications with an optimized dense matrix-matrix multiplication. We focus on compiler optimisations using the -fast ag and discuss di_erences in un-optimised and optimised codes

    Implementation Aspects of a SPARC V9 Complete Machine Simulator

    No full text
    In this paper we present work in progress in the development of a complete machine simulator for the UltraSPARC, an implementation of the SPARC V9 architecture. The complexity of the UltraSPARC ISA presents many challenges in developing a reliable and yet reasonably efficient implementation of such a simulator. Our implementation includes a heavily objectoriented design for the simulator modules and infrastructure, caching of repeated computations for performance, adding an OS (system call) emulation mode to the simulator and a va- riety of testing strategies. An ultimate and critical goal in constructing such an artifact is to successfully boot an existing operating system from it; we describe techniques implemented so far, and outline the remaining work and issues, in order to achieve this goal

    Theory and simulations of confined polymer fluids

    Full text link

    Implementation of the software systems for the SkyMapper automated survey telescope

    No full text
    This paper describes the software systems implemented for the wide-field, automated survey telescope, SkyMapper. The telescope is expected to operate completely unmanned and in an environment where failures will remain unattended for several days. Failure analysis was undertaken and the control system extended to cope with subsystem failures, protecting vulnerable detectors and electronics from damage. The data acquisition and control software acquires and stores 512 MB of image data every twenty seconds. As a consequence of the short duty cycle, the preparation of the hardware subsystems for the successive images is undertaken in parallel with the imager readout. A science data pipeline will catalogue objects in the images to produce the Southern Sky Survey
    corecore